library(readr)
sneaker_data <- read_csv("StockX-Data-Contest-2019-3.csv", )
## Rows: 99956 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Order Date, Brand, Sneaker Name, Sale Price, Retail Price, Release ...
## dbl (1): Shoe Size
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(sneaker_data)
## # A tibble: 6 × 8
## `Order Date` Brand `Sneaker Name` Sale …¹ Retai…² Relea…³ Shoe …⁴ Buyer…⁵
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 9/1/2017 Yeezy Adidas-Yeezy-Boost… $1,097 $220 9/24/2… 11 Califo…
## 2 9/1/2017 Yeezy Adidas-Yeezy-Boost… $685 $220 11/23/… 11 Califo…
## 3 9/1/2017 Yeezy Adidas-Yeezy-Boost… $690 $220 11/23/… 11 Califo…
## 4 9/1/2017 Yeezy Adidas-Yeezy-Boost… $1,075 $220 11/23/… 11.5 Kentuc…
## 5 9/1/2017 Yeezy Adidas-Yeezy-Boost… $828 $220 2/11/2… 11 Rhode …
## 6 9/1/2017 Yeezy Adidas-Yeezy-Boost… $798 $220 2/11/2… 8.5 Michig…
## # … with abbreviated variable names ¹`Sale Price`, ²`Retail Price`,
## # ³`Release Date`, ⁴`Shoe Size`, ⁵`Buyer Region`
From this we can see that there are a few formatting issues. The column types are wrong, as the price variables and date variables need to be numeric. Additionally, I will change the names of the columns for workability.
colnames(sneaker_data) <- c('Order_Date',
'Brand',
'Shoe_Name',
'Resale_Price',
'Retail_Price',
'Release_Date',
'Shoe_Size',
'Buy_Region')
sneaker_data2 <- sneaker_data %>%
mutate(Order_Date = as.Date(Order_Date, format = "%m/%d/%Y")) %>%
mutate(Resale_Price = parse_number(Resale_Price)) %>%
mutate(Retail_Price = parse_number(Retail_Price)) %>%
mutate(Release_Date = as.Date(Release_Date, format = "%m/%d/%Y")) %>%
mutate(Shoe_Size = as.numeric(Shoe_Size))
head(sneaker_data2)
## # A tibble: 6 × 8
## Order_Date Brand Shoe_Name Resal…¹ Retai…² Release_…³ Shoe_…⁴ Buy_R…⁵
## <date> <chr> <chr> <dbl> <dbl> <date> <dbl> <chr>
## 1 2017-09-01 Yeezy Adidas-Yeezy-Boos… 1097 220 2016-09-24 11 Califo…
## 2 2017-09-01 Yeezy Adidas-Yeezy-Boos… 685 220 2016-11-23 11 Califo…
## 3 2017-09-01 Yeezy Adidas-Yeezy-Boos… 690 220 2016-11-23 11 Califo…
## 4 2017-09-01 Yeezy Adidas-Yeezy-Boos… 1075 220 2016-11-23 11.5 Kentuc…
## 5 2017-09-01 Yeezy Adidas-Yeezy-Boos… 828 220 2017-02-11 11 Rhode …
## 6 2017-09-01 Yeezy Adidas-Yeezy-Boos… 798 220 2017-02-11 8.5 Michig…
## # … with abbreviated variable names ¹Resale_Price, ²Retail_Price,
## # ³Release_Date, ⁴Shoe_Size, ⁵Buy_Region
sum(is.na(sneaker_data2))
## [1] 0
The column types are all correct, and there are no missing values in the data set. Now that the data is properly manipulated, we can begin exploring it.
## `summarise()` has grouped output by 'Shoe_Name'. You can override using the
## `.groups` argument.
It is interesting to see how much aftermarket value some sneakers have. Generally speaking, most of the Nike Off White shoes resale for double their retail value.
The most popular sneaker in this case will be the one with the highest resale value. We will split the shoes by brand, to see which has the more sought-after shoe.
mostPopularNikeSneaker <- head(NikeSneakers,1) %>% print()
## # A tibble: 1 × 4
## # Groups: Shoe_Name [1]
## Shoe_Name Brand Retail_Price Average_Resal…¹
## <chr> <chr> <dbl> <dbl>
## 1 Air-Jordan-1-Retro-High-Off-White-White Off-White 190 1826.
## # … with abbreviated variable name ¹Average_Resale_Price
The most popular Nike Off-White sneaker is the Air Jordan 1 Retro High Off White in the White color.
mostPopularYeezySneaker <- head(YeezySneakers,1) %>% print()
## # A tibble: 1 × 4
## # Groups: Shoe_Name [1]
## Shoe_Name Brand Retail_Price Average_Resale_Price
## <chr> <chr> <dbl> <dbl>
## 1 Adidas-Yeezy-Boost-350-Low-Turtledove Yeezy 200 1532.
The Most popular Yeezy sneaker is the Adidas Yeezy Boost 350 Low in the TurtleDove color.
profitOfShoe <- avgResaleBySneaker %>% summarize(Brand, Shoe_Name, Retail_Price, Average_Resale_Price, Average_Profit = Average_Resale_Price - Retail_Price) %>% unique()
## `summarise()` has grouped output by 'Shoe_Name'. You can override using the
## `.groups` argument.
head(profitOfShoe %>% arrange(-Average_Profit),1)
## # A tibble: 1 × 5
## # Groups: Shoe_Name [1]
## Shoe_Name Brand Retail_Price Avera…¹ Avera…²
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Air-Jordan-1-Retro-High-Off-White-White Off-White 190 1826. 1636.
## # … with abbreviated variable names ¹Average_Resale_Price, ²Average_Profit
The Nike Off-White Air Jordan 1 in the white colorway has the highest profit of $1636.
head(profitOfShoe %>% arrange(+Average_Profit),1)
## # A tibble: 1 × 5
## # Groups: Shoe_Name [1]
## Shoe_Name Brand Retail_Price Average_Resale_P…¹ Avera…²
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Adidas-Yeezy-Boost-350-V2-Sesame Yeezy 220 264. 44.1
## # … with abbreviated variable names ¹Average_Resale_Price, ²Average_Profit
The Adidas Yeezy Boost 250 V2 in the Sesame colorway has the lowest profit of $44.
Hover over a color to see the resale value, name, and retail cost of an Adidas Yeezy Sneaker.
The y-axis displays the cumulative profit for all sneakers in the category, but hovering over the color of a sneaker on the graph will represent the average profit of that shoe.
Hover over a color to see the resale value, name, and retail cost of a Nike Off-White sneaker.
The y-axis displays the cumulative profit for all sneakers in the category, but hovering over the color of a sneaker on the graph will represent the average profit of that shoe.
Now that we have discovered the most and least profitable shoes, it seems fitting to explore and determine what role certain factors may or may not have on profitability.
Let’s see if there is a relationship between shoe size and profit by running a simple linear regression on shoe size and average profit.
## `summarise()` has grouped output by 'Shoe_Size'. You can override using the
## `.groups` argument.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
From this graph we can see that there is a small trend at the end for larger shoe sizes and profit.
sizeModel <- lm(Avg_Profit ~ Shoe_Size, data = sneakerSizePlotData)
sizeModel
##
## Call:
## lm(formula = Avg_Profit ~ Shoe_Size, data = sneakerSizePlotData)
##
## Coefficients:
## (Intercept) Shoe_Size
## 147.673 9.669
ggplot(sneakerSizePlotData, aes(x = Shoe_Size, y= Avg_Profit)) + geom_point() + stat_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'
Displayed above is the model that was computed and the graph with a regression line. We can now summarize and determine if there is a relationship.
summary(sizeModel)
##
## Call:
## lm(formula = Avg_Profit ~ Shoe_Size, data = sneakerSizePlotData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.68 -13.37 0.73 16.38 664.46
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 147.67282 0.30536 483.6 <2e-16 ***
## Shoe_Size 9.66894 0.03171 304.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.35 on 99954 degrees of freedom
## Multiple R-squared: 0.4819, Adjusted R-squared: 0.4819
## F-statistic: 9.298e+04 on 1 and 99954 DF, p-value: < 2.2e-16
Based on this, there does not seem to be significant difference to suggest that the model has any predictive ability. There does not seem to be a relationship between shoe size and profit.
Now that we have examined the effect of shoe size on average profit, let’s take a look at buyer region. Following a similar procedure, we can see if region affects the profit of a shoe by running a simple linear regression.
## `summarise()` has grouped output by 'Buy_Region'. You can override using the
## `.groups` argument.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
There already does not seem to be a trend with the data.
regionModel <- lm(Avg_Profit ~ Buy_Region, data = sneakerRegionPlotData)
regionModel
##
## Call:
## lm(formula = Avg_Profit ~ Buy_Region, data = sneakerRegionPlotData)
##
## Coefficients:
## (Intercept) Buy_RegionAlaska
## 183.888 44.636
## Buy_RegionArizona Buy_RegionArkansas
## 56.474 12.643
## Buy_RegionCalifornia Buy_RegionColorado
## 87.538 40.651
## Buy_RegionConnecticut Buy_RegionDelaware
## 17.744 113.763
## Buy_RegionDistrict of Columbia Buy_RegionFlorida
## 60.115 53.984
## Buy_RegionGeorgia Buy_RegionHawaii
## 37.465 100.561
## Buy_RegionIdaho Buy_RegionIllinois
## -10.524 35.857
## Buy_RegionIndiana Buy_RegionIowa
## 21.158 69.955
## Buy_RegionKansas Buy_RegionKentucky
## 14.667 65.855
## Buy_RegionLouisiana Buy_RegionMaine
## 12.868 -27.434
## Buy_RegionMaryland Buy_RegionMassachusetts
## 40.289 35.760
## Buy_RegionMichigan Buy_RegionMinnesota
## 22.145 44.706
## Buy_RegionMississippi Buy_RegionMissouri
## -1.453 25.871
## Buy_RegionMontana Buy_RegionNebraska
## 15.601 13.103
## Buy_RegionNevada Buy_RegionNew Hampshire
## 94.923 31.946
## Buy_RegionNew Jersey Buy_RegionNew Mexico
## 56.419 29.722
## Buy_RegionNew York Buy_RegionNorth Carolina
## 49.741 25.012
## Buy_RegionNorth Dakota Buy_RegionOhio
## 28.955 34.781
## Buy_RegionOklahoma Buy_RegionOregon
## 42.313 79.521
## Buy_RegionPennsylvania Buy_RegionRhode Island
## 26.475 23.189
## Buy_RegionSouth Carolina Buy_RegionSouth Dakota
## 33.013 -6.779
## Buy_RegionTennessee Buy_RegionTexas
## 23.830 23.128
## Buy_RegionUtah Buy_RegionVermont
## 56.187 67.457
## Buy_RegionVirginia Buy_RegionWashington
## 58.204 51.480
## Buy_RegionWest Virginia Buy_RegionWisconsin
## -28.111 40.476
## Buy_RegionWyoming
## -59.363
ggplot(sneakerRegionPlotData, aes(x = Buy_Region, y= Avg_Profit)) + geom_point() + stat_smooth(method = lm) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
## `geom_smooth()` using formula = 'y ~ x'
The program is unable to create a regression line. We can still analyze the summary of the model.
summary(regionModel)
##
## Call:
## lm(formula = Avg_Profit ~ Buy_Region, data = sneakerRegionPlotData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.089e-08 0.000e+00 0.000e+00 0.000e+00 1.109e-07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.839e+02 1.686e-11 1.090e+13 <2e-16 ***
## Buy_RegionAlaska 4.464e+01 4.914e-11 9.083e+11 <2e-16 ***
## Buy_RegionArizona 5.647e+01 1.942e-11 2.907e+12 <2e-16 ***
## Buy_RegionArkansas 1.264e+01 3.218e-11 3.929e+11 <2e-16 ***
## Buy_RegionCalifornia 8.754e+01 1.706e-11 5.131e+12 <2e-16 ***
## Buy_RegionColorado 4.065e+01 2.051e-11 1.982e+12 <2e-16 ***
## Buy_RegionConnecticut 1.774e+01 2.004e-11 8.856e+11 <2e-16 ***
## Buy_RegionDelaware 1.138e+02 1.972e-11 5.768e+12 <2e-16 ***
## Buy_RegionDistrict of Columbia 6.012e+01 2.764e-11 2.175e+12 <2e-16 ***
## Buy_RegionFlorida 5.398e+01 1.746e-11 3.092e+12 <2e-16 ***
## Buy_RegionGeorgia 3.747e+01 1.884e-11 1.989e+12 <2e-16 ***
## Buy_RegionHawaii 1.006e+02 2.497e-11 4.027e+12 <2e-16 ***
## Buy_RegionIdaho -1.052e+01 3.872e-11 -2.718e+11 <2e-16 ***
## Buy_RegionIllinois 3.586e+01 1.785e-11 2.008e+12 <2e-16 ***
## Buy_RegionIndiana 2.116e+01 2.027e-11 1.044e+12 <2e-16 ***
## Buy_RegionIowa 6.996e+01 2.381e-11 2.938e+12 <2e-16 ***
## Buy_RegionKansas 1.467e+01 2.582e-11 5.681e+11 <2e-16 ***
## Buy_RegionKentucky 6.586e+01 2.347e-11 2.806e+12 <2e-16 ***
## Buy_RegionLouisiana 1.287e+01 2.294e-11 5.609e+11 <2e-16 ***
## Buy_RegionMaine -2.743e+01 3.562e-11 -7.702e+11 <2e-16 ***
## Buy_RegionMaryland 4.029e+01 1.881e-11 2.142e+12 <2e-16 ***
## Buy_RegionMassachusetts 3.576e+01 1.814e-11 1.971e+12 <2e-16 ***
## Buy_RegionMichigan 2.214e+01 1.820e-11 1.216e+12 <2e-16 ***
## Buy_RegionMinnesota 4.471e+01 2.153e-11 2.076e+12 <2e-16 ***
## Buy_RegionMississippi -1.453e+00 3.289e-11 -4.417e+10 <2e-16 ***
## Buy_RegionMissouri 2.587e+01 2.194e-11 1.179e+12 <2e-16 ***
## Buy_RegionMontana 1.560e+01 5.419e-11 2.879e+11 <2e-16 ***
## Buy_RegionNebraska 1.310e+01 2.854e-11 4.591e+11 <2e-16 ***
## Buy_RegionNevada 9.492e+01 2.119e-11 4.480e+12 <2e-16 ***
## Buy_RegionNew Hampshire 3.195e+01 2.870e-11 1.113e+12 <2e-16 ***
## Buy_RegionNew Jersey 5.642e+01 1.766e-11 3.195e+12 <2e-16 ***
## Buy_RegionNew Mexico 2.972e+01 2.910e-11 1.021e+12 <2e-16 ***
## Buy_RegionNew York 4.974e+01 1.709e-11 2.910e+12 <2e-16 ***
## Buy_RegionNorth Carolina 2.501e+01 1.952e-11 1.281e+12 <2e-16 ***
## Buy_RegionNorth Dakota 2.896e+01 4.811e-11 6.018e+11 <2e-16 ***
## Buy_RegionOhio 3.478e+01 1.879e-11 1.851e+12 <2e-16 ***
## Buy_RegionOklahoma 4.231e+01 2.449e-11 1.728e+12 <2e-16 ***
## Buy_RegionOregon 7.952e+01 1.736e-11 4.581e+12 <2e-16 ***
## Buy_RegionPennsylvania 2.648e+01 1.806e-11 1.466e+12 <2e-16 ***
## Buy_RegionRhode Island 2.319e+01 2.567e-11 9.034e+11 <2e-16 ***
## Buy_RegionSouth Carolina 3.301e+01 2.264e-11 1.458e+12 <2e-16 ***
## Buy_RegionSouth Dakota -6.779e+00 5.145e-11 -1.318e+11 <2e-16 ***
## Buy_RegionTennessee 2.383e+01 2.150e-11 1.108e+12 <2e-16 ***
## Buy_RegionTexas 2.313e+01 1.751e-11 1.321e+12 <2e-16 ***
## Buy_RegionUtah 5.619e+01 2.394e-11 2.347e+12 <2e-16 ***
## Buy_RegionVermont 6.746e+01 4.280e-11 1.576e+12 <2e-16 ***
## Buy_RegionVirginia 5.820e+01 1.864e-11 3.122e+12 <2e-16 ***
## Buy_RegionWashington 5.148e+01 1.882e-11 2.736e+12 <2e-16 ***
## Buy_RegionWest Virginia -2.811e+01 3.267e-11 -8.605e+11 <2e-16 ***
## Buy_RegionWisconsin 4.048e+01 2.095e-11 1.932e+12 <2e-16 ***
## Buy_RegionWyoming -5.936e+01 5.944e-11 -9.987e+11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.605e-10 on 99905 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 9.524e+24 on 50 and 99905 DF, p-value: < 2.2e-16
There does not seem to be a relationship between the two variables.